Membuat Data Frame
df <- data.frame(store=c("Jakarta","Bandung","Surabaya"), sales=c(1000,2000,3000), revenue=c(150,210,500))
dfclass(df)## [1] "data.frame"
Eksplorasi Data Frame Retail
Membaca data retail dan assign dengan nama retail
getwd()## [1] "E:/workspace/TRAINING ALGORITMA/DAY 1/online/P4DS"
# cara 1 -> relative path
retail <- read.csv(file = "data_input/retail.csv", sep = ",")
retail# cara 2 -> full path
retail2 <- read.csv(file = "E:/workspace/TRAINING ALGORITMA/DAY 1/online/P4DS/data_input/retail.csv", sep = ",")
retail2Inspect our data menggunakan names(), length(), dan dim()
names(retail) #info nama kolom## [1] "Row.ID" "Order.ID" "Order.Date" "Ship.Date" "Ship.Mode"
## [6] "Customer.ID" "Segment" "Product.ID" "Category" "Sub.Category"
## [11] "Product.Name" "Sales" "Quantity" "Discount" "Profit"
length(retail) # jumlah kolom## [1] 15
dim(retail) # dimensi## [1] 9994 15
Investigasi data dengan head() dan tail()
head(x = retail, n = 10) #top ntail(x = retail, n = 2) #least nMelihat struktur data dengan fungsi str()
str(retail)## 'data.frame': 9994 obs. of 15 variables:
## $ Row.ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Order.ID : chr "CA-2016-152156" "CA-2016-152156" "CA-2016-138688" "US-2015-108966" ...
## $ Order.Date : chr "11/8/16" "11/8/16" "6/12/16" "10/11/15" ...
## $ Ship.Date : chr "11/11/16" "11/11/16" "6/16/16" "10/18/15" ...
## $ Ship.Mode : chr "Second Class" "Second Class" "Second Class" "Standard Class" ...
## $ Customer.ID : chr "CG-12520" "CG-12520" "DV-13045" "SO-20335" ...
## $ Segment : chr "Consumer" "Consumer" "Corporate" "Consumer" ...
## $ Product.ID : chr "FUR-BO-10001798" "FUR-CH-10000454" "OFF-LA-10000240" "FUR-TA-10000577" ...
## $ Category : chr "Furniture" "Furniture" "Office Supplies" "Furniture" ...
## $ Sub.Category: chr "Bookcases" "Chairs" "Labels" "Tables" ...
## $ Product.Name: chr "Bush Somerset Collection Bookcase" "Hon Deluxe Fabric Upholstered Stacking Chairs, Rounded Back" "Self-Adhesive Address Labels for Typewriters by Universal" "Bretford CR4500 Series Slim Rectangular Table" ...
## $ Sales : num 262 731.9 14.6 957.6 22.4 ...
## $ Quantity : int 2 3 2 5 2 7 4 6 3 5 ...
## $ Discount : num 0 0 0 0.45 0.2 0 0 0.2 0.2 0 ...
## $ Profit : num 41.91 219.58 6.87 -383.03 2.52 ...
Kolom apa yang tipenya belum sesuai? - order.date -> date - ship.date -> date - ship.mode -> factor - category -> factor - subcategory -> factor - segment -> factor
Dari struktur data di atas, variabel/kolom Order.Date dan Ship.Date perlu diubah ke tipe data yang sesuai. Ubahlah kedua variabel tersebut ke bentuk yang sesuai
Ubahlah Order.Date dan Ship.Date ke tipe date
retail$Order.Date<-as.Date(retail$Order.Date, format = "%m/%d/%y")
retailUbahlah kolom Ship.Date ke dalam tipe date
retail$Ship.Date<-as.Date(retail$Ship.Date, format = "%m/%d/%y")
retailUbahlah variabel kolom Ship.Mode dan Segment ke tipe factor
retail$Ship.Mode <- as.factor(retail$Ship.Mode)
retail$Segment <- as.factor(retail$Segment)
retailSubsetting
Menduplikasi fungsi head dengan subsetting retail [ , ]
objek[baris, kolom]
head(retail)retail[1:6, ] #[<baris>,<kolom>]Subsetlah data ke 1, 5,7,10 untuk kolom Ship.Mode dan Profit
#cara 1
retail[c(1,5,7,10),c(1,5)]#cara 2
retail[c(5:10,25:30),c("Ship.Mode","Profit")]Buanglah variabel Row.ID, Order.ID, Customer.ID, kemudian assign ke object bernama retail_clean. Menggunakan data[,-c()] untuk menghapus kolom tertentu.
#cara 1
retail_clean <- retail[,-c(1,2,6)]
retail_clean#cara 2
retail_clean2<-subset(x=retail, select = -c(1,2,6))
retail_clean2Melihat rangkuman dari data retail_clean menggunakan fungsi summary()
retail_clean$Category<-as.factor(retail_clean$Category)
retail_clean$Sub.Category<-as.factor(retail_clean$Sub.Category)
retail_clean$Product.Name<-as.factor(retail_clean$Product.Name)
summary(retail_clean)## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1538
## 1st Qu.:2015-05-23 1st Qu.:2015-05-27 Same Day : 543
## Median :2016-06-26 Median :2016-06-29 Second Class :1945
## Mean :2016-04-30 Mean :2016-05-03 Standard Class:5968
## 3rd Qu.:2017-05-14 3rd Qu.:2017-05-18
## Max. :2017-12-30 Max. :2018-01-05
##
## Segment Product.ID Category
## Consumer :5191 Length:9994 Furniture :2121
## Corporate :3020 Class :character Office Supplies:6026
## Home Office:1783 Mode :character Technology :1847
##
##
##
##
## Sub.Category Product.Name Sales
## Binders :1523 Staple envelope : 48 Min. : 0.444
## Paper :1370 Easy-staple paper : 46 1st Qu.: 17.280
## Furnishings: 957 Staples : 46 Median : 54.490
## Phones : 889 Avery Non-Stick Binders : 20 Mean : 229.858
## Storage : 846 Staples in misc. colors : 19 3rd Qu.: 209.940
## Art : 796 KI Adjustable-Height Table: 18 Max. :22638.480
## (Other) :3613 (Other) :9797
## Quantity Discount Profit
## Min. : 1.00 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 2.00 1st Qu.:0.0000 1st Qu.: 1.729
## Median : 3.00 Median :0.2000 Median : 8.666
## Mean : 3.79 Mean :0.1562 Mean : 28.657
## 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.364
## Max. :14.00 Max. :0.8000 Max. : 8399.976
##
Dari summary di atas, apa informasi yang kita dapatkan?
- rata rata diskon 15,62%
- segmen terbesar ada di consumer
- perngiriman terbanyak pake standart
- profit ada kerugian
Dive Deeper
- Periksalah summary dari data-data yang rugi
summary(retail_clean[retail_clean$Profit<=0,] )## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-04 Min. :2014-01-08 First Class : 303
## 1st Qu.:2015-05-06 1st Qu.:2015-05-09 Same Day : 99
## Median :2016-06-12 Median :2016-06-16 Second Class : 315
## Mean :2016-04-25 Mean :2016-04-29 Standard Class:1219
## 3rd Qu.:2017-05-05 3rd Qu.:2017-05-09
## Max. :2017-12-30 Max. :2018-01-03
##
## Segment Product.ID Category Sub.Category
## Consumer :1038 Length:1936 Furniture :747 Binders :613
## Corporate : 573 Class :character Office Supplies:915 Chairs :255
## Home Office: 325 Mode :character Technology :274 Tables :205
## Storage :185
## Furnishings:176
## Phones :138
## (Other) :364
## Product.Name Sales
## Global Wood Trimmed Manager's Task Chair, Khaki: 12 Min. : 0.444
## Global High-Back Leather Tilter, Burgundy : 10 1st Qu.: 12.859
## KI Adjustable-Height Table : 10 Median : 73.170
## SAFCO Commercial Wire Shelving, Black : 10 Mean : 256.402
## Hon 61000 Series Interactive Training Tables : 9 3rd Qu.: 298.581
## GBC Instant Report Kit : 8 Max. :22638.480
## (Other) :1877
## Quantity Discount Profit
## Min. : 1.000 Min. :0.000 Min. :-6599.978
## 1st Qu.: 2.000 1st Qu.:0.200 1st Qu.: -55.490
## Median : 3.000 Median :0.400 Median : -16.920
## Mean : 3.768 Mean :0.469 Mean : -80.646
## 3rd Qu.: 5.000 3rd Qu.:0.700 3rd Qu.: -5.631
## Max. :14.000 Max. :0.800 Max. : 0.000
##
- Periksalah summary dari data-data yang untung
summary(retail_clean[retail_clean$Profit>0,] )## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1235
## 1st Qu.:2015-05-26 1st Qu.:2015-05-29 Same Day : 444
## Median :2016-06-28 Median :2016-07-02 Second Class :1630
## Mean :2016-05-01 Mean :2016-05-04 Standard Class:4749
## 3rd Qu.:2017-05-15 3rd Qu.:2017-05-19
## Max. :2017-12-30 Max. :2018-01-05
##
## Segment Product.ID Category
## Consumer :4153 Length:8058 Furniture :1374
## Corporate :2447 Class :character Office Supplies:5111
## Home Office:1458 Mode :character Technology :1573
##
##
##
##
## Sub.Category
## Paper :1370
## Binders : 910
## Art : 796
## Furnishings: 781
## Phones : 751
## Accessories: 683
## (Other) :2767
## Product.Name
## Staple envelope : 48
## Easy-staple paper : 46
## Staples : 46
## Staples in misc. colors : 19
## Logitech 910-002974 M325 Wireless Mouse for Web Scrolling: 15
## Avery Non-Stick Binders : 14
## (Other) :7870
## Sales Quantity Discount Profit
## Min. : 0.99 Min. : 1.000 Min. :0.00000 Min. : 0.063
## 1st Qu.: 17.94 1st Qu.: 2.000 1st Qu.:0.00000 1st Qu.: 5.325
## Median : 50.97 Median : 3.000 Median :0.00000 Median : 13.318
## Mean : 223.48 Mean : 3.795 Mean :0.08104 Mean : 54.918
## 3rd Qu.: 191.98 3rd Qu.: 5.000 3rd Qu.:0.20000 3rd Qu.: 40.466
## Max. :17499.95 Max. :14.000 Max. :0.40000 Max. :8399.976
##
Dari hasil summary tersebut, apa insight yang bisa diambil? 1. Sub Kategory plg merugi adalah binder 2. Paling untung adalah paper 3. chair & table perlu dievaluasi penjualan karena tidak menyumbang profit
Subsetting dengan 2 kondisi
Untuk mensubset dengan 2 kondisi dan kedua kondisinya terpenuhi, bisa menggunakan operator & dan
Misal saya ingin periksa untuk barang2 yang untung (profit > 0) dan juga didiskon (discount > 0), kebanyakan dari cetagory apa
summary(retail_clean[retail_clean$Profit>0 & retail_clean$Discount>0,] )## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class : 540
## 1st Qu.:2015-05-08 1st Qu.:2015-05-12 Same Day : 186
## Median :2016-07-02 Median :2016-07-03 Second Class : 633
## Mean :2016-04-29 Mean :2016-05-03 Standard Class:1931
## 3rd Qu.:2017-05-26 3rd Qu.:2017-05-28
## Max. :2017-12-30 Max. :2018-01-04
##
## Segment Product.ID Category
## Consumer :1705 Length:3290 Furniture : 538
## Corporate :1029 Class :character Office Supplies:2010
## Home Office: 556 Mode :character Technology : 742
##
##
##
##
## Sub.Category
## Binders : 573
## Paper : 513
## Phones : 442
## Art : 298
## Chairs : 229
## Accessories: 212
## (Other) :1023
## Product.Name
## Staples : 23
## Staple envelope : 18
## Easy-staple paper : 14
## Avery Non-Stick Binders : 11
## GBC Premium Transparent Covers with Diagonal Lined Pattern: 9
## Newell 312 : 9
## (Other) :3206
## Sales Quantity Discount Profit
## Min. : 1.344 Min. : 1.00 Min. :0.1000 Min. : 0.153
## 1st Qu.: 15.802 1st Qu.: 2.00 1st Qu.:0.2000 1st Qu.: 3.757
## Median : 47.984 Median : 3.00 Median :0.2000 Median : 9.312
## Mean : 219.140 Mean : 3.78 Mean :0.1985 Mean : 36.942
## 3rd Qu.: 202.784 3rd Qu.: 5.00 3rd Qu.:0.2000 3rd Qu.: 29.262
## Max. :11199.968 Max. :14.00 Max. :0.4000 Max. :3919.989
##
Untu mensubset dengan 2 kondisi tetapi setidaknya salah satu kondisinya terpenuhi, bisa menggunakan operator | atau
summary(retail_clean[retail_clean$Category=="Office Supplies" | retail_clean$Segment=="Consumer",] )## Order.Date Ship.Date Ship.Mode
## Min. :2014-01-03 Min. :2014-01-07 First Class :1223
## 1st Qu.:2015-05-12 1st Qu.:2015-05-16 Same Day : 450
## Median :2016-06-18 Median :2016-06-21 Second Class :1569
## Mean :2016-04-25 Mean :2016-04-29 Standard Class:4848
## 3rd Qu.:2017-05-13 3rd Qu.:2017-05-17
## Max. :2017-12-30 Max. :2018-01-05
##
## Segment Product.ID Category
## Consumer :5191 Length:8090 Furniture :1113
## Corporate :1820 Class :character Office Supplies:6026
## Home Office:1079 Mode :character Technology : 951
##
##
##
##
## Sub.Category Product.Name Sales
## Binders :1523 Staple envelope : 48 Min. : 0.444
## Paper :1370 Easy-staple paper : 46 1st Qu.: 14.700
## Storage : 846 Staples : 46 Median : 41.370
## Art : 796 Avery Non-Stick Binders: 20 Mean : 187.453
## Furnishings: 494 Staples in misc. colors: 19 3rd Qu.: 154.900
## Appliances : 466 Staple remover : 18 Max. :13999.960
## (Other) :2595 (Other) :7893
## Quantity Discount Profit
## Min. : 1.000 Min. :0.0000 Min. :-6599.978
## 1st Qu.: 2.000 1st Qu.:0.0000 1st Qu.: 1.814
## Median : 3.000 Median :0.1500 Median : 7.796
## Mean : 3.791 Mean :0.1573 Mean : 24.756
## 3rd Qu.: 5.000 3rd Qu.:0.2000 3rd Qu.: 24.998
## Max. :14.000 Max. :0.8000 Max. : 6719.981
##
Dive Deeper:
Berapakah jumlah transaksi yang terjadi pada Order.Date tahun 2016 dari data retail_clean?
#cara 1
retail_2016<-retail_clean[retail_clean$Order.Date>="2016-01-01" & retail_clean$Order.Date<"2017-01-01",]
retail_2016#cara 2
retail_clean[format(retail_clean$Order.Date, format="%Y")==2016,]nrow() dan ncol()
nrow(retail_clean) # untuk mengekstrak jumlah rows## [1] 9994
ncol(retail_clean) # untuk mengekstrak jumlah kolom## [1] 12
Mengurutkan data berdasarkan profit terbesar ke terkecil, dan sebaliknya
# mengurutkan dari besar ke kecil
#cara 1
retail_clean[order(retail_clean$Profit, decreasing = TRUE),]#cara 2
retail_clean[order(-retail_clean$Profit),]# mengurutkan dari kecil ke besar
#cara 1
retail_clean[order(retail_clean$Profit, decreasing = FALSE),]#cara 2
retail_clean[order(retail_clean$Profit),]Another Subsetting Operator
Contoh penggunaan %in% untuk mensubset anggota pada variabel tertentu
bulan_genap <- c("February", "April", "June", "August", "October", "December")
bulan_genap## [1] "February" "April" "June" "August" "October" "December"
#step
#1. derived kolom bulan dari order date
retail_cleanretail_clean$bulan <- months(x=retail_clean$Order.Date, abbreviate = F)
retail_clean#2. subset data retail_clean untuk bulan genap
retail_clean[retail_clean$bulan %in% bulan_genap,]#3. subset data retail_clean untuk bulan ganjil
retail_clean[!retail_clean$bulan %in% bulan_genap,]Contoh penggunaan operator ! bukan
retail_clean[!retail_clean$bulan %in% bulan_genap,]Contoh penggunaan operator != tidak sama dengan
retail_clean[!retail_clean$bulan != "Desember",]Frequency Table table() dan proportion table prop.table()
Cek jumlah proporsi dari Ship.Mode, Segment, dan Category
Melihat jumlah tiap-tiap kategori menggunakan table()
seperti fungsi count() pada SQL
# untuk ship.mode
table(retail_clean$Ship.Mode)##
## First Class Same Day Second Class Standard Class
## 1538 543 1945 5968
# untuk segment
table(retail_clean$Segment)##
## Consumer Corporate Home Office
## 5191 3020 1783
# untuk category
table(retail_clean$Category)##
## Furniture Office Supplies Technology
## 2121 6026 1847
Cek jumlah proportion table dari variabel Ship.Mode dengan prop.table()
prop.table(table(retail_clean$Ship.Mode))##
## First Class Same Day Second Class Standard Class
## 0.1538923 0.0543326 0.1946168 0.5971583
Multiple Frequency Table dari Ship.Mode dan Category
table(retail_clean$Ship.Mode, retail_clean$Category)##
## Furniture Office Supplies Technology
## First Class 327 910 301
## Same Day 119 326 98
## Second Class 427 1152 366
## Standard Class 1248 3638 1082
prop.table(table(retail_clean$Ship.Mode, retail_clean$Category),
margin = 1)##
## Furniture Office Supplies Technology
## First Class 0.2126138 0.5916775 0.1957087
## Same Day 0.2191529 0.6003683 0.1804788
## Second Class 0.2195373 0.5922879 0.1881748
## Standard Class 0.2091153 0.6095845 0.1813003
prop.table(table(retail_clean$Ship.Mode, retail_clean$Category),
margin = 2)##
## Furniture Office Supplies Technology
## First Class 0.15417256 0.15101228 0.16296697
## Same Day 0.05610561 0.05409890 0.05305901
## Second Class 0.20132013 0.19117159 0.19815918
## Standard Class 0.58840170 0.60371723 0.58581483
Kalau mau diconvert ke data frame
as.data.frame(prop.table(table(retail_clean$Ship.Mode, retail_clean$Category),
margin = 2))Note: argumen margin = 1 untuk menjumlahkan perbaris, kalau 2 perkolom margin= yang mau di jadiin acuan 100% baris/kolom 1: baris 2: kolom tanpa margin: seluruh dataset
Dari tampilan di atas, apa yang dapat kamu simpulkan? 1. Ship.Mode Standart Class paling banyak digunakan di setiap kategori 2. Office Supplies paling banyak dibeli di seluruh mode shipment
Membuat Variabel baru bernama profitable yang dihasilkan dari jika profit > 0 maka profitable, jika profit <= 0 maka non-profitable
retail_clean$Profitable <- ifelse(test = retail_clean$Profit >0, yes = "Profitable", no = "Non Profitable")Cek kembali data retail setelah penambahan variabel
retail_cleanDive Deeper
Which following segment makes up the most of our “>1000 Sales” transaction? Subset the data for retail$Sales >= 1000 and then use table() with the “Segment” and “Category” variables as its parameters
Among the transactions that ship on “First Class” in
retail_clean, how many of them were office supplies (to two decimal points)?
Pembahasan No. 1
retail_1000 <- retail_clean[retail_clean$Sales >= 1000,]
table(retail_1000$Segment, retail_1000$Category)##
## Furniture Office Supplies Technology
## Consumer 82 55 88
## Corporate 58 45 51
## Home Office 29 22 38
Pembahasan No. 2
retail_2 <- retail_clean[retail_clean$Ship.Mode=="First Class",]
round(prop.table(table(retail_2$Category)),2)##
## Furniture Office Supplies Technology
## 0.21 0.59 0.20
Sample
RNGkind(sample.kind = "Rounding")## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(10)
sample(1:13, 1)## [1] 7
Cross Tabulation
menggunakan xtabs()
xtabs(formula = Profit ~ Ship.Mode+ Category, data= retail_clean)## Category
## Ship.Mode Furniture Office Supplies Technology
## First Class 3066.9474 18400.3291 27502.5634
## Same Day 797.3484 6423.5192 8670.8913
## Second Class 4226.2614 27068.1676 26152.2064
## Standard Class 10360.7156 70598.7849 83129.2870
as.data.frame(xtabs(formula = Profit ~ Ship.Mode+ Category, data= retail_clean))menggunakan plot(xtabs()) untuk melihat visualisasi dari cross tab
plot(xtabs(formula = Profit ~ Ship.Mode+ Category, data= retail_clean))menggunakan aggregate()
retail_agg<-aggregate(Profit ~ Ship.Mode+ Category, data = retail_clean, FUN = length)
retail_agg[order(retail_agg$Profit, decreasing = TRUE),]pilihan FUN: mean, median, sum, min, max, var, sd, etc
Membuat Fungsi
pembagian <- function(x, y){
m <- x / y
m
}
pembagian(x = 30, y = 5)## [1] 6
penjumlahan<-function(val1, val2){
add<-val1+val2
add
}penjumlahan(val1=1,val2=9)## [1] 10
cara install dan load library/package
EXTRA EXERCISE
A 30-minutes exercise in Break Out Room.
Create a new project, name it with cofee_study. Please create a Rmd, .RProj, and copy the coffee_study.txt dataset to the folder coffee_study. Then, answers these questions below:
Calculate the mean age for each combined level of both gender and drink
Calculate the mean age for each gender
How many male who drinks green tea?
Please order the dataset based on its score (Urutkan dari besar ke kecil)
How many people who drinks coffee who has lower than the median age?
After that, knit the .Rmd file to .html and collect the .html file to the classroom task named P4DS Exercise.
Link tambahan
Untuk mencari data yang bisa diakses secara gratis :
Untuk mempelajari lebih lanjut terkait data tanggal menggunakan as.Date
Cara untuk membuat .RProj
Tahapan membuat .RProj ketika belum mempunyai folder untuk .RProj
- Klik menu File
- New Project
- New Directory
- New Project
- Isi directory name
- Specify working directory path yang diinginkan
Tahapan membuat .RProj ketika sudah mempunyai folder untuk .RProj
- Klik menu File
- New Project
- Existing Directory
- Tentukan RProj ingin diletakkan di folder apa (Specify Working Directory)
Beberapa package yang bisa diinstall untuk template R-Markdown
- install.packages(“prettydoc”)
- install.packages(“rmdformats”)
- install.packages(“hrbrthemes”)
Custom tampilan .Rmd bisa dipelajari lebih lanjut di :
- Help - Cheatsheets - RMarkdown Cheatsheet
- https://bookdown.org/yihui/rmarkdown/html-document.html
- Masuk ke rpubs.com untuk referensi laporan yang telah orang lain buat